The described program analyzes and isolates equipment faults concur- 
rently with regular processing. 


If necessary, the program replaces system elements by realigning com- 
munication and control paths. 


Dependence of the program’s replacement decisions upon the recording 
of extensive error statistics 1s also discussed. 


An application-oriented multiprocessing system 


IV The operational error analysis program 
by D. C. Lancto and R. L. Rockefeller 


The use of multiprocessing in the Federal Aviation Administra- 
tion’s (.AA’s) air traffic control application has necessitated the 
development of a new type of program—the Operational Error 
Analysis Program (obAP). Such a program performs on-line analyses 
of equipment failure indications concurrently with regular process- 
ing, and assesses the source and seriousness of each potential mal- 
function. If necessary, the program realigns the communication and 
control paths in the multi-element system to functionally replace 
failing elements by redundant elements. 

Historically, major repair and preventive-maintenance activi- 
ties have required dedication of the system to these particular jobs. 
In multiprocessing systems, this is expensive; and in the case of 
real-time applications, such as air traffic control, it may be pro- 
hibited by the application requirements for system availability. 
The 9020 system (see Part II of this paper) is potentially able to 
monitor its own errors, isolate failures to an element of the system, 
and functionally replace failing elements by redundant elements. 

In response to application requirements, the equipment design 
of the 9020 system includes special provisions for configuration con- 
trol, inter-element error indications, error-checking and logout fa- 
cilities, and address translation. The system-program design in- 
cludes the oFAP, system checkpoints in the application programs, 
and unit and system diagnostic procedures for off-line maintenance. 
All of these features contribute to the high availability of the 9020 
system design. This Part of the paper concentrates on a description 
of OEAP, mentioning other 9020 control features only as they relate 
tO OEAP. 
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Tigure 1 suggests the broad relationship between 9020 elements and 
the main system programs. The basic provisions for error checking 
and identification, as well as the prerequisites for program control 
of system configuration, are designed into the equipment. lor the 
sake of efficiency, the control program was given responsibility for 
checkpoint and restart functions; hence oF AP is called into use only 
when an equipment error is reported. ofAP then analyzes the error 
environment; if the error persists, OHAP isolates the malfunctioning 
element and removes it from the operational system. After report- 
ing its findings and actions to the control program, oEAP makes 
any other configuration changes directed by that program. Data 
that may help to pinpoint the error within the removed element 
are printed for the maintenance personnel. 

The 9020 system consists of seven types of equipment elements: 
Computing Element, Storage Element, Input/Output Control Ele- 
ment, Tape Control Unit, Tape Drive, Systems Console, and Pe- 
ripheral Adapter Module. Each of these elements has been designed 
so that a stored program executed in any Computing Element can 
control system operation, monitor any error situations, and permit 
system reconfigurations. There is no built-in master-slave relation- 
ship; any relationships between Computing Elements must exist 
under the surveillance of the particular Computing Element that is 
running the applicable sections of the control program. 

The configuration-control feature allows program control over 
system configurations. Element states, data paths, and control 
paths are dynamically specified and activated by ofap. Manual re- 
quests for system reconfigurations can be transmitted to the pro- 
gram via a typewriter. Configuration control allows the formation 
of complete and separate subsystems; these subsystems can be used 
in program debugging, miscellaneous production work, and sched- 
uled or unscheduled maintenance—as well as in the main applica- 
tion. 

The various system elements contain a variety of checks on 
data paths, control paths, and environmental conditions (e.g., tem- 
perature sensors). The error-handling design philosophy divides 
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such checks into two categories: (1) those that can be handled with 
normal 1/o techniques by the Computing Element that initiated 
the operation, and (2) those that must be indicated to the system 
(in this case, “system” is defined as all Computing Elements that 
are set up to “‘listen’’ to such errors). The first category embraces 
errors for which the pertinent error environment can be obtained 
through the normal 1/o sense commands. A Peripheral Adapter 
Module data check, for example, would fall in this category. 

On the other hand, a power-failure check from a Peripheral 
Adapter Module would belong in the second category. System 
checks, which can occur at any time (as contrasted with data 
checks, which always occur during operational use of a unit), are 
transmitted to the diagnose-accessible register in each Computing 
Element. Because its register is maskable, a Computing Element 
can selectively accept or ignore indications of system errors, 

A process that preserves element information is said to log the 
information, and the information logged is called a logout. The 
ability of the equipment to log the system environment whenever a 
malfunction occurs is essential to the obAP error-analysis function. 
Each Computing Element has the ability to log many of its own 
registers, as well as all important registers in Storage Elements, 
I/O Control Elements, and other Computing Elements; data from 
Computing Elements and I/O Control Elements go into the pref- 
erential-storage area that is controlled by the Computing Element. 
Whenever an I/O Control Element error condition is detected, the 
I/O Control Element automatically logs after ‘“‘Spermission”’ is re- 
ceived from a Computing Element. A Computing Element logs 
automatically when its error logic detects a malfunction, whereas a 
Storage Element logs under control of the stored program in a 
Computing Element. For detailed error information from a Periph- 
eral Adapter Module or Tape Control Unit, the Computing Ele- 
ment depends on sense and status data obtained through the nor- 
mal 1/o means. Thus, complete system environment data are avail- 
able to oBAP for error-analysis purposes. 

In the 9020 system, the address translation feature controls the 
logical assignment of addresses in each Storage Element. Address 
translation registers are loaded by the SET ADDRESS TRANS- 
LATOR instruction. Address bands of 32,768 words may be assigned 
to any Storage Element using the address translation feature. Thus, 
when the system loses a Storage Element, the reconfiguration proc- 
ess can fill an address gap with no program relocations beyond 
those needed to correctly load the new Storage Element. 

The checkpoint subprogram, part of the operational address- 
translation program, operates every thirty seconds and records 
about 56,000 words on magnetic tape. All dynamic tables, as well 
as the address translation registers, are recorded. Whenever it gains 
control, the checkpoint subprogram “‘locks up” each table; no table 
is recorded until all tables are locked, thus guaranteeing that the 
latest information is recorded. As each table is recorded (the most- 
used tables are recorded first), it is unlocked so that operational 
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processing may resume. The time required to fully complete a 
checkpoint is about three seconds. 

Because configuration control allows the setup of isolated sub- 
systems within the 9020 system, maintenance functions can employ 
the full complement of unit and system off-line diagnostics pre- 
pared for factory and acceptance-test checkouts. The need for sep- 
arately designed on-line diagnostic programs has been mitigated 
by this maintenance approach. The multi-element nature of the 
system, combined with the need for choosing maintenance sub- 
systems from among many possible choices, dictated an approach 
that places onAP within the control program framework. 

The bulk of the unit diagnostic code was obtained from standard 
SYSTEM/360 modules. A comprehensive multiprocessing diagnostic 
control program that operates single or multiple Computing Ele- 
ments in concurrent fashion was designed and written. A system 
evaluation program has been provided to check out system paths 
in the multiprocessor environment. Unit diagnostics for the Pe- 
ripheral Adapter Module and diagnostics for distinctive 9020 fea- 
tures were generated by the FAA project group. 


Program objectives 


The main functional objectives of the Operational Error Analysis 
Program are: 


e Error-check analysis and fault isolation 
e Maintenance of error statistics 

e Error environment reporting 

¢ Reconfiguration 


The bulk of okApP is devoted to analyzing the error-check en- 
vironment and to isolating, if possible, the malfunctioning element 
or interface (an interface being defined as the equipment between 
the points at which error checking stops in one element and begins 
in another element that is communicating with the first element). 
Malfunctions give rise to abnormal-condition signals of three pos- 
sible kinds: element checks (£1.c’s), out-of-tolerance checks (oTc’s), 
and on-battery signals (oBs’s). ELC signals can be presented to the 
diagnose-accessible registers of the Computing Elements in two 
forms. A pulsed ELC is of short duration and is presented once per 
appearance of a check condition; after issuing a pulsed Exc, the is- 
suing element attempts to continue operation. A level zLc, on the 
other hand, is continuously presented to the diagnose-accessible 
registers until the check condition is cleared in the issuing element. 
A level reLc from an element indicates that the element can pro- 
ceed no further without external help. 

Errors are classified as solzd or intermittent by the error-analysis 
programs. A distinction is possible in most cases because solid error 
conditions show up as level ELc’s in the appropriate error register. 
Normally, the reading of an error register clears the register, but if 
the error 1s solid, the error bit persists in the ‘‘on” condition at 
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suecessive readings. Error conditions not readily classified as solid 
are typically classified as intermittent. 

The main purpose of the error-analysis function 1s the identifi- 
cation of malfunctioning elements and interfaces. Although error 
indications show up classified by type and/or elements involved, 
error conditions tend to be reported in multiple. For instance, an 
I/O Control Element that has a problem with a Storage Element 
may report to the Computing Element that there is an I/O Control 
Element as well as a Storage Element problem. Finding the failing 
element or interface then becomes a logical exercise for onAP. The 
program systematizes the techniques traditionally used by human 
beings. 

The analysis routines report their findings to the error-control 
and statistical routine of oar. For the benefit of operational and 
maintenance personnel, okAP dynamically reports the condition of 
an element when it reported an error check. A count of the number 
and frequency of intermittent errors for each element is main- 
tained. Interface errors are recorded when errors occur in both of 
two elements communicating with each other. These error statistics 
are helpful in deciding whether one or more elements should be re- 
moved from the system. The error count is used by the system op- 
erator to decide whether one or both of two interfacing elements 
must be removed from the system. 

Another of OEApP’s primary functions is to promptly record 
error-environment information. When informed that a malfunction 
has been detected, the control program receives pertinent informa- 
tion about the failure. In case of a solid error, ofAP reports that an 
element has been deleted; in case of an intermittent error, it must 
be decided what further action (usually reconfiguration) should be 
undertaken by o£ap. Within a few seconds of the reported error, 
relevant information is reported via high-speed printer, typewriter, 
and magnetic tape. 

OEAP has sole responsibility for maintaining the system con- 
figuration; except at initial program loading, it alone executes the 
reconfiguration instruction. Since the configuration control registers 
are not readily accessible to a program, OEAP simulates their con- 
tents in a table that is duplicated in different Storage Elements. 

Whenever an error occurs during an 1/o operation, the control 
program attempts to execute the operation again by ordinary retry 
procedures. OFAP records pertinent retry information, consisting 
mainly of sense and status data, as an aid to a more efficient main- 
tenance of 1/o channels and devices. 

OEAP operates in the Supervisor mode and executes most of the 
privileged instructions. Normally resident in main memory, OEAP 
is directly called into use by every machine-check interruption. 
Moreover, when other interruption conditions indicate that an 
error condition exists, the control program lends control to OBAP. 
oFAP also controls all of the interruption program status words in 
the alternate Preferential Storage Area (psa). (Because the alter- 
nate PsA is located precisely 32,768 words in the address range 
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above the primary psa, it is located in another Storage Element. 
The alternate psa is referred to automatically, via equipment, when 
the Storage Element containing the primary psa becomes unavail- 
able.) Some of the editing tasks performed by oFAP are executed in 
the problem-program mode after system processing is resumed, 
thus taking less time away from productive processing. 

Unless specifically directed to ignore them, OEAP monitors the 
error conditions reported in non-operational subsystems. The mo- 
tive here is to ensure that the redundant elements being held in 
readiness are in first-class operating condition. Whenever a re- 
dundant element reports a failure, it is reconfigured into an inactive 
state by OBAP. 

When oEaP gains control to analyze a reported malfunction, 
other productive processing temporarily halts. Part of the omar 
philosophy is that the Computing Element taking the machine- 
check interruption will first attempt to execute obapr. Other Com- 
puting Elements in the operational system are directed to begin 
“‘time-down”’ operations of various lengths, the shortest time-down 
operation being longer than that needed for a normal OEFAP recov- 
ery. Since the majority of errors occurring in the system are inter- 
mittent errors, the Computing Element in which the machine check 
originated will probably be successful. But if a time-down is com- 
pleted, the associated Computing Element assumes responsibility 
for the error analysis and recovery operation. 

OBAP requires about 52,000 bytes of main storage, as well as a 
system tape that can be used to restore the OEAP program In case 
of failure in the Storage Element that contains the oEAP. 


Program design 


The Operational Error Analysis Program is designed to take ad- 
vantage of the 9020 error-reporting facilities. The diagnose-acces- 
sible register illustrates the type of information onAP has to work 
with. The format of this register is shown in Figure 2. The diagnose- 
accessible register provides detailed interruption source informa- 
tion. A bit in the register is unconditionally set on the receipt of an 
abnormal condition signal. Each bit in the register is individually 
maskable by a corresponding bit in the select register, except for 


Figure 2 Diagnose-accessible register 
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Figure 3. Relationship of OEAP and system control program 
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the bits indicating I/O Control Element information. Since such 
information is encoded, one bit in the select register will mask both 
I/O Control Element bits. To allow an interruption, a correspond- 
ing bit must be ‘‘on” in the select register. The control program 
becomes aware of the fact that a bit in the diagnose-accessible 
register 1s set to 1 when both an external interruption occurs and 
bit 31 in the external-interruption program status word is 1. OEAP 
is then called to analyze the error environment. 

The main error-reporting vehicle is the machine-check inter- 
ruption, which causes immediate activation of orap. Machine 
checks interrupt a Computing Element when error conditions are 
detected within this element, in the I/O Control Element con- 
trolled by this element, or in the Storage Element being accessed 
when a failure occurs. In the machine-check case, the error envi- 
ronment consists of logout data and Computing Element check 
registers. 

External interruptions are caused by (1) Peripheral Adapter 
Module and Tape Control Unit problems, (2) a Storage Element 
which detected a failure while not being accessed by a Computing 
Element (but being accessed by an I/O Computing Element or 
not being accessed at all), or (3) one Computing Element informing 
the other Computing Elements that it has taken a machine check. 
When an external interruption occurs, the control program acti- 
vates OEAP. 

Certain equipment error conditions manifest themselves in the 
form of program or 1/o interruptions. Such interruptions are in- 
terpreted by the control program; if the control program deems 
that an interruption 1s associated with an error condition, OEAP is 
activated. 

The relationship between oE£aP and the control program is sug- 
gested by Figure 3, which also introduces the onap task names and 
flow concepts. From the diagram, it can be seen that there are six 
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Figure 4 Typical set of OEAP internal interface codes: SEI has a solid error 
and is replaced by SE7 
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* 1D CODE ALWAYS INDICATES ORIGINATING TASK 
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connecting paths between oFaP and the control program, the major 
path being the one that links the error control task (called zoxc) 
with the control program. As js obvious from the diagram, ZOEC is 
the focal point of o—aP. Once entered, zorc controls all subsequent 
OBAP activity. 

The oraP design includes a set of ‘interface’ codes. Whenever 
one task transfers control to another, or whenever zoEc returns 
control to the control program, a list of the interface codes is 
passed. These codes can indicate analytical findings and they can 
specify operations to be performed by the program gaining control. 
Each interface code occupies two bytes. The design also makes 
provision for passing data independently of the interface codes. 
Shown in Figure 4 is a typical sequence of codes that might be in- 
volved when the machine-check analysis task (zomx) is entered. 
Note that ozap knows which task is responsible for each list. 

Four error analysis tasks are shown in Figure 3: machine-check 
analysis (ZoMx), external-interruption analysis (zoEE), 1I/o-inter- 
ruption analysis (ZocE), and program interruption analysis (ZOPE). 
Each of these tasks analyzes the environment surrounding a re- 
ported error, but only one operates at a given time. Controls are 
built into the program to prevent more than one Computing Ele- 
ment from trying to simultaneously execute the error-analysis tasks. 

ZOMX is entered when one of the active Computing Elements 
takes a machine-check interruption (the Computing Element logs 
its vital registers and control triggers before yielding control). 
ZOMX Immediately locks up okapP by turning on the oEaP “active” 
switch and by masking off interruptions that can be masked. The 
“active” switch is a control that prevents more than one Com- 
puting Element from executing the error analysis tasks at the same 
time. 

A Computing Element obtains error environment data from 
three main sources: a machine check from itself, an I/O Control 
Element, or a Storage Element. To begin its analysis, the Com- 
puting Element looks at the check-register data in its own logout. 
A Computing Element logs itself automatically when it experiences 
a machine check. From these data, the element can determine the 
source or sources of the error condition. There are 24 check-register 
indicators for various logic checks made on the Computing Ele- 
ment. Another 14 indicators are used to indicate conditions in a 
Computing Element or in Computing-to-Storage Element interface 
areas specifically designed for the 9020 system, and four bits serve 
to indicate which Storage Element is having problems from the 
Computing Element’s standpoint. zomx examines each of these in- 
dicators and determines the general flow of its subsequent analysis. 

If zomx determines from the interruption code that an I/O 
Computing Element malfunction has been reported, the I/O Com- 
puting Element’s check registers (which have been automatically 
logged by the element) are examined. The I/O Control Element 
has 36 internal logic check indicators, of which 12 are included for 
special 9020 system logic. From either the Computing Element or 
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I/O Control Element logout, zomx may determine that a Storage 
Element ought to be logged also. zomx can request a logout from 
the appropriate Storage Element. The seven words in a Storage 
Element logout reflect the status of ten check indicators as well as 
pertinent registers. 

From the data in logouts, zomx attempts to trace error indica- 
tions back to the primary source of trouble. In designing zomx and 
ZOEE, it was anticipated that faults will normally be reported in 
multiple. For instance, if the Storage Element develops trouble in 
an I/O Control Element to Storage Element fetch or store, both 
I/O Control Element and Storage Element report a difficulty. In 
this case, the Storage Element logout suffices to indicate that the 
I/O Control Element request reached the Storage Element and 
that the Storage Element detected its own logic problem. This ex- 
ample is, of course, a relatively simple one for zomx to interpret. 

After a fault has been isolated to an element, the next step is 
to determine whether the error is intermittent or solid. zomx twice 
reads the diagnose-accessible register. If the bit that indicates a 
possible source of error is cleared after the first reading, the error is 
considered intermittent; if the bit still indicates an error at the 
second reading, the error is classified as solid. Hence intermittent 
errors occurring in rapid succession may be considered by OFAP as 
a solid failure. 

Even in the absence of machine-check interruptions, the con- 
trol program may decide that an error condition has occurred. Be- 
cause one of the error analysis tasks will be involved in this event, 
a control switch is used to indicate whether onap is already being 
executed. For most external interruptions, which would require 
ZOEE, OEAP will already be in use. In that case, the busy Comput- 
ing Element takes on an OEAP monitoring role, a function to be 
explained later. zoEr’s function is to analyze the error environment 
created when a Peripheral Adapter Module or Tape Control Unit 
reports a problem directly to a Computing Element, when certain 
error conditions exist for a Storage Element, or when a Computing 
Element cannot recover from its analysis of a machine check. In 
each of these cases, zorE can look at existing logouts, obtain new 
data, or merely utilize information from the diagnose-accessible 
register. 

All analysis tasks proceed somewhat similarly once an error has 
been isolated, i.e., they reveal their identity and generate appropri- 
ate interface codes, thus communicating to zorc the cause and 
source of the error. If necessary, the code passed to zoEc indicates 
that the source of the error could not be determined. 

When zocz is invoked because of failure in an 1/o operation, the 
error-environment information consists of a table of thirteen words 
containing sense and status data, device address, number of retries 
attempted, etc. When, for instance, a flight-strip printer is ad- 
dressed as an output device, another useful datum is the character 
that was being transmitted when the failure occurred plus the three 
previous characters. These characters are printed to assist the 
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maintenance man. ZOCE does not attempt any diagnostic 1/o op- 
erations but relies solely on the information presented to it by the 
control program. 

zocE develops and maintains an error history table in which it 
stores statistical information concerning each device, control unit, 
and interface. Using this data, zocE determines when the control 
units (Peripheral Adapter Modules and Tape Control Units) and 
the I/O Control Elements have generated enough intermittent 
failures to deserve replacement. zocE uses parametric algorithms 
to control these decisions. 

ZOPE, the last error-analysis task to be discussed, examines two 
main error conditions: an I/O Control Element that cannot access 
its preferential storage, and a Computing Element that has found 
a Storage Element in a logout-stopped condition. After the control 
program refers such program interruptions to the onapP for investi- 
gation, ZOPE attempts to restart stopped Storage Elements by log- 
ging them. ZOPE reports its actions and conditions to zoxc. 

Error statistics are kept by three of the orap tasks: ZOcE, ZORR 
(1/o retry recording), and zozc. As mentioned above, zocE uses 
error statistics in determining when a control unit or I/O Control 
Element should be removed from the system. For example, when 
one of the adapters (as many as 160 adapters may exist) fails 
solidly, the Peripheral Adapter Module must still be allowed to 
remain in the system. zock reports the failure by the proper inter- 
face codes, but the Peripheral Adapter Module is charged by zorc 
with an intermittent error. Particular combinations of adapter 
failures may prompt zoEc to make an independent recommenda- 
tion that the Peripheral Adapter Module be removed. 

ZORR is a Statistical data-gathering task that can be entered by 
the control program or by zocs. Its function 1s to record sense and 
status data, either for a particular device address on which an 1/o 
retry is performed or for an 1/o operation that was successfully or 
unsuccessfully retried. zorr keeps track of the number of retries 
made in each retry sequence, as well as of the number of times each 
retry was made with the same response from the device. The in- 
formation from zoRR’s table for a device is reported when a retry 
succeeds or when the control program abandons its retry attempts. 

Most of the error statistics are kept and most of the decisions 
concerning those statistics are made by zoxc. In its error table, 
ZOEC counts the intermittent errors in all active system elements. 
Whenever an intermittent error is reported, zoEc not only updates 
the count for the appropriate element or interface but also com- 
pares the new total to a pair of thresholds. The first threshold in- 
dicates the point at which the element must be considered as mar- 
ginally serviceable. At this point, it must be decided whether the 
element should be removed from the system. If a redundant ele- 
ment can be brought in and a number of other conditions are favor- 
able, the marginal element is replaced. When the second threshold 
is reached by the error count, zoEc considers the element to have 
solidly failed and sets up the appropriate interface codes by which 
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zORK (the reconfiguration task) can reconfigure the element from 
the system. When oFapr has completed its work, error counts and 
thresholds are printed for review by the operator and the main- 
tenance personnel. 

In the 9020 system, reconfiguration functions are controlled, al- 
most entirely, by onap. The main exception occurs at system load 
time, when an initial program load must be performed before oOEAP 
is resident in main storage. As soon as OBAP is loaded and supplied 
with configuration information, it reconfigures the system in ac- 
cordance with its tables. Since the configuration control registers 
in the 9020 system elements are not readily available to the pro- 
grammer, OEAP maintains configuration tables (in fact, it stores 
them in duplicate) to assure that reconfiguration operations are 
correctly specified and performed. | 

Reconfiguration operations 1n OEAP are performed by zork, 
which follows the directions forwarded to it by zoznc. The com- 
mands given to ZORK can require that an element’s configuration 
register and address translation register be changed in various ways. 
ZORK can delete an element from the system, connect an element to 
other elements, or change the element’s state. zorK checks whether 
the configuration register in an element is set by monitoring the 
response when the element actually sets its configuration control 
register. If the response is not returned, zoRK tries a second time 
to set the element’s configuration control register. If a second failure 
is noted, zoRK removes that element from the system and attempts 
to clear the faulty element’s configuration control register into 
“state zero” for maintenance purposes. 

ZORK ensures that a Computing Element does not delete itself 
or the Storage Element containing oBAP (and thus zorx itself) from 
the system. 

Whenever a subsystem is to be set up for maintenance, data 
analysis, or other uses, zoRK sets up the desired configuration at 
manual request. However, any redundant element needed 1m- 
mediately by the operational system can be recalled by zorkx. 

One of obApP’s basic functions is to document the error environ- 
ment through its zomnR and zorEp tasks. The normal recording media 
are magnetic tape, high-speed printer, and 1052 typewriter. Data 
in raw form are recorded on magnetic tape as backup for the high- 
speed printer output or for subsequent off-line computer analysis. 
The high-speed printer is used to present the formatted results of 
OEAP’s analysis and data gathering efforts. Header data is followed 
by the formatted logout and ofap’s current system configuration 
data. 

Through zoER and zZOED, OEAP records other information useful 
to the system operator and maintenance man. This information 
includes various tables that are normally printed at request, 1/0 
retry data recorded by zorr, and the results of system reconfig- 
urations made at manual request. 

The design goals of oEAP assign an important role to the error 
analysis monitoring function that is accomplished by zopt. It is 
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interesting to note, though, that a monitoring operation is not es- 
sential for successful recovery. Since the majority of errors (perhaps 
nine out of ten) are expected to be intermittent, the Computing 
Element taking a machine check should normally make a successful 
recovery. 

When a Computing Element receives a machine check, as ex- 
plained earlier, the oEAP busy switch is set and other active Com- 
puting Elements are forced to take an external interruption. Recog- 
nizing that an error condition exists, the control program branches 
to the zoEE task of oEAP. Because the orap busy switch is set, 
each Computing Element is forced by the external interruption to 
execute zosT. If the Computing Element actually executing the 
main portion of oEAP cannot recover, the first Computing Element 
that starts executing the error monitoring portion of zospT takes 
responsibility for recovery. This decision is based on elapsed time; 
the monitoring Computing Element allows the prime-recovery 
Computing Element approximately 350 milliseconds. 

ZOBT contains code to overcome the loss of the Storage Element, 
called the psa sE, that stores onAP. If the psa s® contents must be 
retrieved from magnetic tape, the monitoring Computing Element 
allows the prime recovery Computing Element seven seconds for 
recovery. When time-down is completed, the monitoring Comput- 
ing Element places the prime recovery Computing Element into 
the wait state and takes over recovery. If there are other Comput- 
ing Elements executing zospr, each one in turn can become the 
monitormg Computing Element for the new prime recovery com- 
puter. 

This explanation of the error analysis monitoring function as- 
sumes that only one Computing Element receives a machine check. 
Actually, whenever any Computing Element begins to execute one 
of the error analysis tasks because of an error, it sets the OEAP 
busy switch and directs all other active Computing Elements to 
execute the error analysis monitoring function. 


Summary comment 
The Operational Error Analysis Program implements the dynamic 
on-line error analysis essential to a high-availability multiprocess- 
ing system. Although this obApP discussion emphasizes the pro- 
gramming aspect, the authors realize that adequate error-checking 
equipment is a prerequisite for advances in programming design. 
OBAP’s design heavily depends on the convention that the Com- 
puting Element receiving a machine-check interruption should at- 
tempt recovery. This rule is fine if oFAP is resident in main storage. 
For applications with severely limited main storage, however, OEAP 
may have to reside in part on disk or drum; then the design philos- 
ophy becomes less desirable because a malfunctioning Computing 
Element necessitates an 1/o operation before the work of analysis 
can start. Depending on the number of Computing Elements, the 
nature of the error, and the amount of main storage, trade-offs are 
obviously involved. 
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